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CHARACTERIZING XaBOOSTING 

By John Ehrlinger and Hemant Ishwaran 
Cleveland Clinic and University of Miami 

We consider Z/2Boosting, a special case of Friedman's generic 
boosting algorithm applied to linear regression under L2-I0SS. We 
study L2 Boosting for an arbitrary regularization parameter and de- 
rive an exact closed form expression for the number of steps taken 
along a fixed coordinate direction. This relationship is used to de- 
scribe L2Boosting's solution path, to describe new tools for studying 
its path, and to characterize some of the algorithm's unique prop- 
erties, including active set cycling, a property where the algorithm 
spends lengthy periods of time cycling between the same coordinates 
when the regularization parameter is arbitrarily small. Our fixed de- 
scent analysis also reveals a repressible condition that limits the ef- 
fectiveness of 1/2 Boosting in correlated problems by preventing desir- 
able variables from entering the solution path. As a simple remedy, 
a data augmentation method similar to that used for the elastic net 
is used to introduce L2-penalization and is shown, in combination 
with decorrelation, to reverse the repressible condition and circum- 
vents L2Boosting's deficiencies in correlated problems. In itself, this 
presents a new explanation for why the elastic net is successful in cor- 
related problems and why methods like LAR and lasso can perform 
poorly in such settings. 

1. Introduction. Given data {yi,Xj}^, where yi is the response and Xj = 
(xj^i, . . . ,Xi^p) £ is the p-dimensional covariate, the goal in many analy- 
ses is to approximate the unknown function F{x) = E(y|x) by minimizing 
a specified loss function L(y, F) [a common choice is L2-I0SS, L{y, F) = 
(y — F)'^ /2]. In trying to estimate F, one strategy is to make use of a large 
system of possibly redundant functions T-L. If T-L is rich enough, then it is 
reasonable to expect F to be well approximated by an additive expansion 
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of the form 

K 

F(x; {/3fc, afcif ) = /?fe/i(x; a^), 

k=l 

where /i(x; a) GTi are base learners parameterized by a G 0. To estimate F, 
a joint multivariable optimization over {/3fc,Qfc}f may be used. But such an 
optimization may be computationahy slow or even infeasible for large dic- 
tionaries. Overfitting may also result. To circumvent this problem, iterative 
descent algorithms are often used. 

One popular method is the gradient descent algorithm described by Fried- 
man (2001), closely related to the method of "matching pursuit" used in the 
signal processing literature [Mallat and Zhang (1993)]. This algorithm is ap- 
plicable to a wide range of problems and loss functions, and is now widely 
perceived to be a generic form of boosting. For the mth step, m = 1, . . . , M, 
one solves 

71 

(1.1) = argmin V'L(yi,Fm_i(xi) +p/i(xi;am)), 
where 

n 

(1.2) Om = argmin ^^[^^(xi) - /i(xi;a)]^ 
identifies the closest base learner to the gradient = {g-m 

(Xi), . . .,gm{-Xn)) 

in L2-distance, where gm{xi) is the gradient evaluated at the current value 
-Fm-i(xj), and is defined by 

- dL{yi,F{^i)y 

The mth update for the predictor of F is 

Fm(x) = Fm_i(x) + vpmh{x;am), 

where < < 1 is a regularization (learning) parameter. 

In this paper, we study Friedman's algorithm under L2-I0SS in linear re- 
gression settings assuming an n x p design matrix X = [Xi , . . . , Xp] , where 
Xfc = (3^1, fcj • • • ,Xn,k)'^ denotes the A:th column. Here X^ represents the kth. 
base learner; that is, h{xi;k) = Xi^k where k = a and = {l,...,p}. It 
is well known that under L2-I0SS the gradient simplifies to the residual 
gm{^i) = Hi — -Pm-i(xj). This is particularly attractive for a theoretical 
treatment as it allows one to combine the line-search (1.1) and the learner- 
search (1.2) into a single step because the -L2-I0SS function can be expressed 
as L(j/i,Fm-i(xj) + /3Xi,fc) = {gm{y^i) - pxi^kY . The optimization problem be- 
comes 

{Pm,km}= argmin Hg^-pX^lp. 

,l<k<p 



5771 (xj) 



= -L'(yi,F„_i(xj)). 
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Algorithm 1 L2 Boosting 
1: Initialize Fq^^ = for i = 1, . . . ,n 
2: for m = 1 to M do 

3: km = argmaxi<fc<p|X^gm|, where = y - F^-i 

4: Fm = Fm-1 + V PnJ^km ' where Pm = ^k^Sm 

5: end for 



It is common practice to standardize the response by removing its mean 
which eliminates the issue of whether an intercept should be included as 
a column of X. It is also common to standardize the columns of X to have 
a mean of zero and squared-length of one. Thus, throughout, we assume the 
data is standardized according to 

n n n 

(1-3) ^yi = 0, ^xi,fc = 0, ^xlf, = l, k = l,...,p. 

i=l i=l i=l 

The condition X^^Lx-'^^fc ~ ■'■ l^ads to a particularly useful simplification: 

Pm = X^"^ grn, km = arg max I X^g„, | . 

i<fc<p 

Thus, the search for the most favorable direction is equivalent to determin- 
ing the largest absolute value |X^gm|. We refer to X^^gm as the gradient- 
correlation for k. We shall refer to Friedman's algorithm under the above 
settings as L2Boosting. Algorithm 1 provides a formal description of the 
algorithm [we use Fm.~i = (-Fm-i(xi), . . . , Fm-i(x„))^ for notational conve- 
nience] . 

Properties of stagewise algorithms similar to L2Boosting have been stud- 
ied extensively under the assumption of an infinitesimally small regulariza- 
tion parameter. Efron et al. (2004) considered a forward stagewise algorithm 
FSe , and showed under a convex cone condition that the Least Angle Regres- 
sion (LAR) algorithm yields the solution path for FSq, the limit of FSg as 
e — 7- 0. This shows that FS^, a variant of boosting, and the lasso [Tibshirani 
(1996)] are related in some settings. Hastie et al. (2007) showed in general 
that the solution path of FSq is equivalent to the path of the monotone lasso. 

However, much less work has focused on stagewise algorithms assum- 
ing an arbitrary learning parameter < z/ < 1. An important exception is 
Biihlmann (2006) who studied L2Boosting with componentwise linear least 
squares, the same algorithm studied here, and proved consistency for arbi- 
trary 1/ under a sparsity assumption where p can increase at an exponential 
rate relative to n. As pointed out in Biihlmann (2006), the FSg algorithm 
studied by Efron et al. (2004) bears similarities to L2Boosting. It is identical 
to Algorithm 1, except for line 4, where e is used in place of and 

Fm = Fm-i + edm^km^ where 5m = sgn[corr(gm, X^^)]. 
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Thus, FSe replaces the gradient-correlation pm with the sign of the gradient- 
correlation 6m - For infinitesimally small v this difference appears to be incon- 
sequential, and it is generally believed that the two limiting solution paths 
are equal [Hastie (2007)]. In general, however, for arbitrary < j> < 1, the 
two solution paths are different. Indeed, Biihlmann (2006) indicated certain 
unique advantages possessed by L2Boosting. Other related work includes 
Biihlmann and Yu (2003), who described a bias- variance decomposition of 
the mean-squared-error of a variant of L2Boosting. 

1.1. Outline and contributions. In this paper, we investigate the prop- 
erties of L2Boosting assuming an arbitrary learning parameter < < 1. 
During L2Boosting's descent along a fixed coordinate direction, a new co- 
ordinate becomes more favorable when it becomes closest to the current 
gradient. But when does this actually occur? We provide an exact simple 
closed form expression for this quantity: the number of iterations to fa- 
vorability (Theorem 2 of Section 2). This core identity is used to describe 
L2Boosting's solution path (Theorem 3), to introduce new tools for study- 
ing its path and to study and characterize some of the algorithm's unique 
properties. One of these is active set cycling, a property where the algorithm 
spends lengthy periods of time cycling between the same coordinates when u 
is small (Section 3). 

Our fixed descent identity also reveals how correlation affects L2 Boosting's 
ability to select variables in highly correlated problems. We identify a re- 
pressible condition that prevents a new variable from entering the active 
set, even though that variable may be highly desirable (Section 4). Using 
a data augmentation approach, similar to that used for calculating the elastic 
net [Zou and Hastie (2005)], we describe a simple method for adding L2- 
penalization to L2Boosting (Section 5). In combination with decorrelation, 
this reverses the repressible condition and improves L2Boosting's perfor- 
mance in correlated problems. Because L2Boosting is known to approximate 
forward stagewise algorithms for arbitrarily small v, it is natural to expect 
these results to apply to such algorithms like LAR and lasso, and thus our 
results provide a new explanation for why these algorithms may perform 
poorly in correlated settings and why methods like the elastic net, which 
makes use of L2-penalization, are more adept in such settings. All proofs 
in this manuscript can be found in the supplemental article [Ehrlinger and 
Ishwaran (2012)]. 

2. Fixed descent analysis. To analyze L2Boosting we introduce the fol- 
lowing notation useful for describing its solution path. Let {li, ... ,1m*} be 
the M* < M nonduplicated values in order of appearance of the selected 
coordinate directions Bm = {ki, . . . ,kM}- We refer to these ordered, nondu- 
plicated values as critical directions of the path. For example, if Bm = {5, 5, 
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So - Si S2 S3 84 S5 Sg 

Fig. 1. Solution path for L2Boosting where Bm = {5,5,5,3,5,1,4,4,5}. The 
M* = 6 critical directions are = (5, 3, 5, 1, 4, 5) with critical descent step lengths 

= (3, 1, 1, 1, 2, 1) and critical points {Sr)\ = (3, 4, 5, 6, 8, 9) . 

5, 3, 5, 1, 4, 4, 5}, the critical directions are {5, 3, 5, 1, 4, 5} and M* = 6. To for- 
mally describe the solution path we introduce the following nomenclature. 

Definition 1. The descent length along a critical direction is denoted 
by Lr- The critical point Sr is the step number at which the descent along l^- 
ends. Thus, following step 5r-i, the descent is along Ir for a total of steps, 
ending at step Sr- 

The set of values {Ir, Lj., Sr)i^'' can be used to formally describe the so- 
lution path of L2 Boosting: the algorithm begins by descending along direc- 
tion li (the first critical direction) for Li steps, after which it switches to 
a descent along direction I2 (the second critical direction) for a total of L2 
steps. This continues with the last descent along Im* (the final critical direc- 
tion) for a total of Lm* steps. See Figure 1 for illustration of the notation. 

A key observation is that L2Boosting's behavior along a given descent 
is deterministic except for its descent length (number of steps). If we 
could determine the descent length, a quantity we show is highly amenable 
to analysis, then an exact description of the solution path becomes possible 
as L2Boosting can be conceptualized as collection of such fixed paths. 

Imagine then that we are at step m' of the algorithm and that in the 
following step a new critical direction k is formed. Let us study the descent 
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Algorithm 2 L2Boosting (Fixed direction, k) 
2: for m = 1 to M' do 

3: Fk,m = Fk,m-l + VPk,m^k, where Pk^m = X^(y " ^k,m-l) 

4: end for 



along k for the next m = 1, . . . , M' steps. Thus, in the mth step of the descent 
along k, the predictor is 

Ffc -m = ^k,m-i + i'Pk,m^k, where Pk,m = ^kiy - Ffc -m-l)- 

Consider then Algorithm 2 which repeatedly boosts the predictor along the 
kth direction for a total of M' steps. 

The following result states a closed form solution for the m-step predictor 
of Algorithm 2 and will be crucial to our characterization of L2Boosting. 

Theorem 1. Fk,m = Fk,o + VmPk,i^k, where i/^ = 1 - (1 - i^)"' and 
Pk,i = ^l{y-Fk,o). 

Theorem 1 shows that taking a single step with learning parameter 
yields the same limit as taking m steps with the smaller learning parame- 
ter v. The result also sheds insight into how v slows the descent relative to 
stagewise regression. Notice that the m-step predictor can be written as 

Ffc,m = Ffc,0 + Pk,l^k — (1 — ^)'^Pk,{^k ■ 

^ ^/ ' V ' 

stagewise slow learning 

The first term on the right is the predictor from a greedy stagewise step, 
while the second term represents the effect of slow-learning. This latter term 
is what slows the descent relative to a greedy step. When m — t- oo this term 
vanishes, and we end up with stagewise fitting, = 1. 

2.1. Directional change in the descent. Theorem 1 shows how to take 
a large boosting step in place of many small steps, but it does not indi- 
cate how many steps must be taken along k before a new variable enters 
the solution path. If this were known, then the entire /c-descent could be 
characterized in terms of a single step. 

To determine the descent length, suppose that L2Boosting has descended 
along k for a total of m steps. At step m + 1 the algorithm must decide 
whether to continue along k or to select a new direction j. To determine 
when to switch directions, we introduce the following definition. 

Definition 2. A direction j is said to be more favorable than k at step 
m + lii \pk,m\ > \Pj,m\ and \pk^m+i\ < \Pj,m+i\- Thus, if j is more favorable 
at m + 1, the descent switches to j for step m + 1. 
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To determine when j becomes more favorable, it will be useful to have 
a closed form expression for pk,m+i and pj^m.+i- By Theorem 1, 

Pj,m+1 = Xj(y - Fk,m) 

= Pj,l — ^mPk,lRj,ki 

where Rj^k = Xj^fc- Setting j = k yields Pk,m+i = (1 ~ v)"^Pk,i- Therefore, 
\Pk,'m+i\ < \Pj,m+i\ if and only if 

(1 - lyf'^pli < {pj,l - VmPk,lRj,kf- 

Dividing throughout by with a little bit of rearrangement, this becomes 

(2.1) (1 - yf^ < [(1 - urn^^u + {dj^k - Rj,k)]\ 

where dj^k = Pj,i/ Pk,i- Notice importantly that \dj^k\ < 1 because k is the 
direction with maximal gradient-correlation at the start of the descent. It is 
also useful to keep in mind that Rj^k is the sample correlation of Xj and 
due to (1.3), and thus \Rj^k\ < 1- The following result states the number of 
steps taken along k before j becomes more favorable. 

Theorem 2. The number of steps rnj^k taken along k so that j becomes 
more favorable than k at ruj k + 1 is the largest integer m such that 

(2.2) {l-yr-^> \'^3>k-Rj,k\ 



1 - Rj^kSgn{dj^k - Rj,k)' 

It follows that for < < 1 

(2.3) m, , = floor [l + M..^ " ^..^1 " l"g(l " R^ kSgnjd,,, - fi,,,)) 

[ log(l — 

where floor(z) is the largest integer less than or equal to z. 

Remark 1. In particular, notice that ?7T'j,fc = oo when dj^fe = Rj^k [adopt- 
ing the standard convention that sgn(O) = and assuming that < 1]. We 
call = Rj^k the repressible condition. Section 4 will show that repress- 
ibility plays a key role in L2Boosting's behavior in correlated settings. 

Remark 2. When = 1 we obtain ^ = 1 from (2.2) which corre- 
sponds to greedy stagewise fitting. Because this makes the 1^ = 1 case unin- 
teresting, we shall hereafter assume that < z/ < 1. 

2.2. Defining the solution path. Theorem 2 immediately shows that the 
problem of determining the next variable to enter the solution path can be 
recast as finding the direction requiring the fewest number of steps ruj^k to 
favor ability. When combined with Theorem 1, this characterizes the entire 
descent and can be used to characterize L2Boosting's solution path. 



8 



J. EHRLINGER AND H. ISHWARAN 



As before, assume that k corresponds to the first critical direction of the 
path, that is, /i = k. By Theorem 2, L2Boosting descends along k for a total 
of Si = Li steps, where Li = rrii^^k and I2 is the coordinate requiring the 
smallest number of steps to become more favorable than k. By Theorem 1, 
the predictor at step Si is 

F51 = Fo + i^l^pII^^i, , where p^]^ = X^(y - Fq). 

Applying Theorem 1 once again, but now using a descent along I2 initial- 
ized at , and continuing this argument recursively, as well as using the 
representation for the number of steps from Theorem 2, yields Theorem 3, 
which presents a recursive description of L2Boosting's solution path. 

Theorem 3. Fs, =Fs,_^ + uuP^^ir, where {{lr,Lr,Sr,pP)}f'' are 
determined recursively from 

h = argmax|Xj(y - Fo)|, Ir+i = argmax|p^.''"^^^|, 



mI''^ = floor 



log \D^p - R,^i^ I - log(l - Rjj^ sgn(Z5f - i?,- ; J) 

1 H 

log(l - u) 



Plr 

Remark 3. A technical issue arises in Theorem 3 when Mj is not 
unique. Non-uniqueness can occur due to rounding which is caused by the 
floor function used in the definition of nij^k. This is why line 1 selects the next 

critical value, Ir+i, by maximizing the absolute gradient-correlation \p)j \ 

(r) 

and not by minimizing the step number Mj . This definition for Z^+i is 
equivalent to the two-step solution 

Ir+i ^ argmax|/9j'''''^^|, where ^^+1 = argminjMj'^'*}. 

Remark 4. Another technical issue arises when there is a tie in the 
absolute gradient-correlation. In line 3 of Algorithm 1 it may be possible 
for two coordinates, say j and k, to have equal gradient-correlations at step 
m > 1. Theorem 3 implicitly deals with such ties due to Definition 2. For 
example, suppose that the first m — 1 steps are along k with the tie occurring 
at step m. In the language of Theorem 2, because j becomes more favorable 
than k at m + 1, where m = nij^k, we have 

\Pj,m-l\ < |Pfc,m-l|) \Pj,m\ = \Pk,m\, \Pj,m+l\ > \Pk,ni+l\- 
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In this example, Theorem 3 resolves the tie at m by continuing to descend 
along k, then switching to j at step m + 1. Although Algorithm 1 does not 
explicitly address this issue, the potential discrepancy is minor because such 
ties should rarely occur in practice. This is because for \pj^m\ = \Pk,m\ to hold, 
the value inside the floor function of (2.3) used to define nij^k must be an inte- 
ger (a careful analysis of the proof of Theorem 2 shows why). A tie can occur 
only when this value is an integer which is numerically unlikely to occur. 

Remark 5. Theorem 3 immediately yields a recursive solution for the 
coefficient vector, (3. The solution path for (3 is the piecewise solution 

where 1/^ G M^' is the vector with one in coordinate and zero elsewhere. 

2.3. Illustration: Diabetes data. Aside from the technical issue of ties, 
Theorem 3 and Algorithm 1 are equivalent. For convenience, we state Theo- 
rem 3 in an algorithmic form to facilitate comparison with Algorithm 1; see 
Algorithm 3. Computationally, Algorithm 3 improves upon Algorithm 1 by 
avoiding taking many small steps along a given descent. However, the differ- 
ence is not substantial because the benefits only apply when is small, and 
as we will show later (Section 3), this forces the algorithm to cycle between 
its variables following the first descent, thus mitigating its ability to take 
large steps. Thus, strictly speaking, the benefit of Algorithm 3 is confined 
primarily to the first descent. 

To investigate the differences between the two algorithms we analyzed 
the diabetes data used in Efron et al. (2004). The data consists of n = 442 
patients in which the response of interest, y, is a quantitative measure of 
disease progression for a patient. In total there are 64 variables, that includes 
10 baseline measurements for each patient, 45 interactions and 9 quadratic 
terms. 

In order to compare results, we translated each iteration, r, used by Al- 
gorithm 3 into its corresponding number of steps, m. Thus, while we ran 
Algorithm 3 for M* = 250 iterations, this translated into M = 332 steps. As 
expected, this difference is primarily due to the first iteration r = 1 which 
took m = 14 steps along the first critical direction (first panel of Figure 2; 



Algorithm 3 L2Boosting (Solution path) 

1: Fo = 0; 5o = 0; h = argmaxi<j<p|Xjy| 
2: for r = 1 to M* do 

; I (''+1)1 (r+1) (r) (r) „ 

4: = iW/ ^ : Sr — Sr—l H~ 

ir-\-l 

5: Fs. = Fs_, + lyuptj^ir 
6: end for 
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Fig. 2. L2Boosting applied to the diabetes data. First two panels display standardized 
gradientcorrelation against step number m for Algorithms 3 and 1, respectively. Only 
coordinates in the solution path are displayed (a total of four). The third panel superimposes 
the first two panels. All analyses used v — 0.005. 



the rug indicates critical points, Sr)- There are other instances where Algo- 
rithm 3 took more than one step (corresponding to the light grey tick marks 
on the rug), but these were generally steps of length 2. The standardized 
gradient-correlation is plotted along the y-axis of the figure. The standard- 
ized gradient-correlation for step m was defined as (using the notation of 
Algorithm 1) 



* 



(2.4) p: 



The middle panel displays the results using Algorithm 1 with M = 250 steps. 
Clearly, the greatest gains from Algorithm 3 occur along the r = 1 descent. 
One can see this most clearly from the last panel which superimposes the 
first two panels. 

Remark 6. Note a potential computational optimization exists in Al- 
gorithm 3. It is possible to calculate the correlation values only once as each 
new variable enters the active set, then cache these values for future cal- 
culations. Thus, when Ir+i is a new variable in the active set, we calculate 
{Rj,ir+i)^=i- The updated gradient-correlation is calculated efficiently by us- 
ing addition and scalar multiplication using the previous gradient-correlation 
and the cached correlation coefficients 

Pj -Pj -^LrPl^ Rj,lr- 

This is in contrast to Algorithm 1 which requires a vector multiplication of 
dimension p at each step m to update the gradient-correlation: pm = X^^g^. 

Remark 7. Above, when we refer to the "active set," we mean the 
unique set of critical directions in the current solution path. This term will 
be used repeatedly throughout the paper. 
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2000 4000 6000 8000 10000 

step, m 



Fig. 3. Distance nij^k of each variable j to favorabihty relative to the current descent k 
(results based on Algorithm 1 where v = Q.QQ^). For visual clarity the ruj^k values have 
been smoothed using a running median smoother. 

2.4. Visualizing the solution path. Throughout the paper we illustrate 
different ways of utilizing rrij^k of Theorem 2 to explore L2 Boosting. So far 
we have confined the use of Theorem 2 to determining the descent length 
along a fixed direction, but another interesting application is determining 
how far a given variable is from the active set. Note that although The- 
orem 2 was described in terms of an active set of only one coordinate, it 
applies in general, regardless of the size of the active set. Thus, ^ can be 
calculated at any step m to determine the number of steps required for j to 
become more favorable than the current direction, k. This value represents 
the distance of j to the solution path and can be used to visualize it. 

To demonstrate this, we applied Algorithm 1 to the diabetes data for 
M = 10,000 steps and recorded ruj^k for each of the p = 64 variables. Figure 3 
records these values. Each "jagged path" in the figure is the trace over the 
10,000 steps for a variable j. Each point on the path equals the number of 
steps rrij^k to favorability relative to the current descent j. The patterns 
are quite interesting. The top variables have rrij^k values which quickly drop 
within the first 1000 steps. Another group of variables have values which take 
much longer to drop, doing so somewhere between 2000 to 4000 steps, but 
then increase almost immediately. These variables enter the solution path 
but then quickly become unattractive regardless of the descent direction. 

It has become popular to visualize the solution path of forward stagewise 
algorithms by plotting their gradient-correlation paths and/or their coeffi- 
cient paths. Figure 3 is a similar tool. A unique feature of is that it 
depends not only on the gradient-correlation (via dj^k), but also the corre- 
lation in the x-variables (via Rj^k) and the learning parameter u. In this 
manner. Figure 3 offers a new tool for understanding and exploring such 
algorithms. 
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3. Cycling behavior. It has been widely observed that decreasing the 
regularization parameter slows the convergence of stagewise descent algo- 
rithms. Efron et al. (2004) showed that the FS^ algorithm tracks the equian- 
gular direction of the LAR path for arbitrarily small e. To achieve what LAR 
does in a single step, the FS^ algorithm may require thousands of small steps 
in a direction tightly clustered around the equiangular vector, eventually 
ending up at nearly the same point as LAR. 

We show that L2Boosting exhibits this same phenomenon. We do so by 
describing this property as an active set cycling phenomenon. Using results 
from the earlier fixed descent analysis, we show in the case of an active set 
of two variables that L2Boosting systematically switches (cycles) between 
its two variables when v is small. For an arbitrarily small v this forces 
the absolute gradient-correlations for the active set variables to be nearly 
equal. This point of equality represents a singularity point that triggers 
a near-perpetual deterministic cycle between the variables, ending only when 
a new variable enters the active set with nearly the same absolute gradient- 
correlation. 

3.1. L^Boosting's gradient equality "point. Our insight will come from 
looking at Theorem 2 in more depth. As before, assume the algorithm has 
been initialized so that k is the first critical step. Previously the descent 
along k was described in terms of steps, but this can be equivalently ex- 
pressed in units of the "step size" taken. Define 

^^i,fc = ^^™,,. = 1-(1-'^P'''- 
Recall that Theorem 1 showed that a single step along k with v replaced 
with Vj ]^ yields the same limit as mj,^ steps along k using v. We call Vj^^ 
the step size taken along k. Because j becomes more favorable than k at 
mj^fc + 1) tfie gradient following a step size of Vj^^ along k satisfies 

(3.1) |Xj(y - Fo - i'j,kPk,\^k)\ < (y - Fo - i^j,kPk,i^k)\- 

This applies to all coordinates j k, and in particular holds for the second 
critical direction, I2, which rephrased in terms of step size, is the smallest Vj^i. 
value, 

h = argmin{zyj_fc}. 

Although inequality (3.1) is strict, it becomes arbitrarily close to equality 
with shrinking v. With a little bit of rearranging, (2.2) implies that 

(3.2) i>j<i^j,k, where i>j- = 1 - .j^^'^* „ ^ • 

We will show Uj is the step size making the absolute gradient-correlation 
between j and k equal 

(3.3) |Xj(y - Fo - i)jPk,i^k)\ = |X|'(y - Fq - i'jPk,i^k)\- 
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The next theorem shows that vi^^^ converges to the smallest satisfy- 
ing (3.3); thus, (3.1) becomes an equality in the limit. For convenience, 
we define u'^ = v^n^^^-i- 

Theorem 4. Let pj = X.J {y — Fq - Oj p^^i^k) ■ Then \pj \ = \pk\ - Further- 
more, if I* = argminj_^^{i>j} and z> = i>/. , then ^ < < f/j.fe and vi^^k — ^ 
as v ^0. 

Therefore, for arbitrarily small z^, vi^^k ^ and k and I2 will have near- 
equal absolute gradient-correlations. This latter property triggers two-cycling. 
To see why, let us assume for the moment that the active set variables have 
equal absolute gradient-correlations. Then by a direct application of The- 
orem 2, one can show that the number of steps taken along I2 before k 
becomes more favorable is m = 1. Thus, following the descent along k, the 
algorithm switches to I2, but then immediately switches back to /c. If is 
small enough, this process is repeated, setting off a two-cycling pattern. 

The next result is a formal statement of these arguments. Define 

4? = ^he^e Pl'^ = - Fm-l), i<i<p. 

' Pk,m 

For notational convenience, let j = I2 and m = rrij^k- For technical reasons 
we shall assume d^J^ / -^j,A:- Recall Remark 1 showed that dj™'' = ^j,k-, tbe 
repressible condition, yields an infinite number of steps to favorability. Thus, 
for k to be even eligible for favorability we must have d^^l^ 7^ ^j,k- 

Theorem 5. If the first two critical directions are {k,j) and Vj^k = i^j, 
then k is favored over j for the next step after j if d^J^ 7^ Rj^j- ■ 

Theorem 5 assumes that Vj^j. = Uj. While this only holds in the limit, 
the two values should be nearly equal for arbitrarily small and thus the 
assumption is reasonable. Notice also that Theorem 5 only shows that k is 
more favorable than j, and not that the algorithm switches to k. However, 
we can see that this must be the case. For arbitrarily small k's gradient- 
correlation should be nearly equal to j's, and by definition, j has maximal 
absolute gradient-correlation along the second descent. 

Indeed, the following result shows that the absolute gradient-correlations 
for k and j can be made arbitrarily close for small enough u for any step 
r > 1 following the descent along k. The result also shows that the sign of 
the gradient-correlation is preserved when ly is arbitrarily small, a fact that 
we shall use later. 

Theorem 6. Pj,m+r/Pk,m+r sgn{pj) / sgn{pk) as i^^O for each r > 1. 

Combining Theorems 5 and 6, we see that if u is small enough, the first 
three critical directions of the path must be {k,j,k) with critical points 
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{m,m + l,m + 2). And once the descent switches back to k, it is clear from 
the same argument that the next critical direction, I4, will be j, and so forth. 

3.2. Illustration of two-cycling. We present a numerical example demon- 
strating two-cycling. For our example, we simulated data according to 

y = X/3 + £, £~A^(0,I), 

where n = 100, and p = 40. The first 10 coordinates of f3 were set to 5, with 
the remaining coordinates set to 0. The design matrix X was simulated by 
drawing its entries independently from a standard normal distribution. 

Figure 4 plots the standardized gradient-correlations (2.4) from Algo- 
rithm 3 using I' = 0.01. As done earlier, we have converted iterations r into 
step numbers m along the x-axis. The plots show the behavior of each co- 
ordinate within an active set descent. The rug marks show each step m for 
clarity, and dashed vertical lines indicate the step ruj^k where the next step 
adds a new critical direction to the solution path. The top left panel shows 
the complete descent along the first three active variables. The remaining 
panels detail the coordinate behavior as the active set increases from one to 
three coordinates. 



d 





Fig. 4. Standardized gradient- correlation path for v — 0.01. Top left panel details the path 
through the first three active variables, the remaining panels detail each active variable 
descent. 
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The top right panel shows repeated selection of the li direction shown in 
black. The last step along li occurs at mj^k marked with the vertical dashed 
line, where the next step is along the I2 direction shown in red. This point 
marks the beginning of the two-cycling behavior, which continues in the 
lower left panel. At each step, the algorithm systematically switches between 
the li and I2 directions, until an additional direction becomes more favorable. 
The cycling pattern is {h,l2,h,h, ■ ■ ■}• The lower right panel demonstrates 
three-cycling behavior. Here it is instructive to note that the order of selec- 
tion within three-cycling is nondeterministic. In this panel the order starts 
as {ls,l2,h, ■ ■ ■} , but changes near m = 70 to {. . . , ^3, /i, • • •}• As discussed 
later, nondeterministic cycling patterns are typical behavior of higher order 
cycling (active sets of size greater than two). 

3.3. The limiting path. Here we provide a formal limiting result of two- 
cycling. The result can be viewed as the analog of Theorem 4 when the active 
set involves two variables. Using a slightly modified version of -L2 Boosting 
we show that for arbitrarily small v, if the algorithm cycles between its two 
active variables, it does so until a new variable enters the active set with the 
same absolute gradient-correlation. 

Assume the active set is ^ = {k,j} and that k and j are cycling according 
to {k,j, k,j, . . .). The m-step predictor for m = l,. . . ,M is 

/g^x p ^ (Fm-i + iypk,m^k, if m is odd, 

™ \ Fm^i + upj^m'^j, if rn is even, 

where pi^m = (y ~ ^m-i)- The cycling pattern (3.4) is assumed to persist 
for a minimum length of M > 3. 

It will simplify matters if the cycling is assumed to be initialized with 
strict equality of the gradient correlations: \pk,i\ = \Pj,i\- With an arbitrarily 
small I', this will force near equal absolute gradient-correlations at each step 
and by Theorem 6 will preserve the sign of the gradient-correlation. We 
assume 

Pj,m sgn(pj, 1) 

— — = for 771 > 1. 

Pk,m Sgn(pfc^i) 

It should be emphasized that the above assumptions represent a simplified 
version of L2Boosting. In practice, we would have 

Pj,m = Spk,m + 0{u), 

where s = sgn(pj^i)/ sgn(/>fc 1). However, for convenience we will not concern 
ourselves with this level of detail here. Readers can consult Ehrlinger (2011) 
for a more refined analysis. 

One way to ensure \pk,i \ = \Pj,i\ is to initialize the algorithm with the lim- 
iting predictor Fq + i>jPk,i^k of Theorem 4 obtained by letting — )• along 
the fc-descent. With a slight abuse of notation denote this initial estimator 
by Fq. However, the fact that this specific Fq is used does not play a direct 
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role in the results. Under the above assumptions, the following closed form 
expression for the m-step predictor under two-cycling holds. 

Theorem 7. Assume that pj^rn = spk,m for m>l. If dj^k 7^ Rj,k, then 
for any < v < 1/2 satisfying 1 + sRj^k > k' have for each m>l, 



Fo + Vm+lPk,l 



Xfc + ^^^(s - yRj,k)^j 



if m is odd, 



Fo + VmPk,i [Xfc + (s - uRj^k)^j], if m is even, 

where Vm = ^^^[1 - (1 - yA)^^"^] o.nd uj^ = v{l + sRj^k - ^R]k)- ^^^'^ ^^'^^ 
< z/_4 < 1 under the asserted conditions. 

To determine the above limit requires first determining when a new direc- 
tion / ^ A becomes more favorable. For I to be more favorable at m + 1, we 
must have |/3j,m+i| < |/0«,m+i| when m is odd, or |/>fc „i+i| < |/3; „^+i| when m 
is even. The following result determines the number of steps to favorability. 
For simplicity only the case when m is odd is considered, but this does not 
affect the limiting result. 

Theorem 8. Assume the same conditions as Theorem 7. Then I be- 
comes more favorable than j at step m + 1 where m is the largest odd integer 
m > 3 such that 

(3.5) (1 - zy^)("*-^)/2 > \di,k - Rj,k,i\ 

1 - Rj^k,i sgn(dz,fc - Rj,k,i) 

where di^k = Pi,i/ Pk,i and 

„ _ Ri,k + {s- uRj,k)Ri,j 
l + sR,,k-'^Rl, ' 

Clearly (3.5) shares common features with (2.2). This is no coincidence. 
The bounds are similar in nature because both are derived by seeking the 
point where the absolute gradient-correlation between sets of variables are 
equal. In the case of two-cycling, this is the singularity point where k, j and / 
are all equivalent in terms of absolute gradient-correlation. The following 
result states the limit of the predictor under two-cycling. 

Theorem 9. Under the conditions of Theorem 7, the limit of F^ as 
u ^0 at the next critical direction I* equals 

F = Fo + z>Pfc,i[Xfc + sXj], 

where I* = argmin^^_4{i>;}, i) = Oi* , 

(3.6) u,= (l- ^ (1 + sR„k)~\ 

V 1 - Rj,k,i sgn(dz,fc - Rj^k,i) J 
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and Rj^k,i = {Ri,k + sRij)/{l + sRj^k)- Furthermore, \pi* \ = \pk\ = \Pj\, where 
for each I, pi = Xf (y - F). 

This shows that the predictor moves along the combined direction + 
sXj taking a step size P that makes the absolute gradient-correlation for I* 
equal to that of the active set A = {k,j}. Theorem 9 is a direct analog of 
Theorem 4 to two-cycling. 

Not surprisingly, one can easily show that this limit coincides with the 
LAR solution. To show this, we rewrite F in a form comparable to LAR, 

F = Fo + i>|/Ofc,i| [sgn(/3fc,i)Xfc + sgn(pj,i)Xj]. 

Recall that LAR moves the shortest distance along the equiangular vector 
defined by the current active set until a new variable with equal absolute 
gradient-correlation is reached. The term in square brackets above is propor- 
tional to this equiangular vector. Thus, since F is obtained by moving the 
shortest distance along the equiangular vector such that {j, k, I*} have equal 
absolute gradient-correlation, F must be identical to the LAR solution. 

3.4. General cycling. Analysis of cycling in the general case where the 
active set A = {ki}f^i is comprised of d > 2 variables is more complex. 
In two-cycling we observed cycling patterns of the form {Ii,l2,h,l2, ■ ■ ■), 
but when d > 2, L2Boosting's cycling patterns are often observed to be 
nondeterministic with no discernible pattern in the order of selected critical 
directions. Moreover, one often observes some coordinates being selected 
more frequently than others. 

A study of (i-cycling has been given by Ehrlinger (2011). However, the 
analysis assumes deterministic cycling of the form 

{h,l2, ■ ■ ■,ld,ld+i, ■■■) = {ki,k2, ...,kd,ki,.. .), 

which is the natural extension of the two-cycling just studied. To accommo- 
date this framework, a modified L2 Boosting procedure involving coordinate- 
dependent step sizes was used. This models L2Boosting's cycling tendency 
of selecting some coordinates more frequently by using the size of a step to 
dictate the relative frequency of selection. Under constraints to the coordi- 
nate step sizes, equivalent to solving a system of linear equations defining the 
equiangular vector used by LAR, it was shown that the modified L2 Boosting 
procedure yields the LAR solution in the limit. Interested readers should 
consult Ehrlinger (2011) for details. 

4. Repressibility affects variable selection in correlated settings. Now we 

turn our attention to the issue of correlation. We have shown that regardless 
of the size of the active set a new direction j becomes more favorable than 
the current direction k at step ruj^k + 1 where rnj^k is the smallest integer 
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value satisfying 



(4.1) 



1 



dj,k Rj,k 



< 1 - (1 - 



^- Rj,kSgn{dj^k- Rj,k) 



Using our previous notation, let and I'j^k denote the left and right-hand 
sides of the above inequality, respectively. 

Generally, large values of ^ are designed to hinder noninformative vari- 
ables from entering the solution path. If j requires a large number of steps to 
become favorable, it is noninformative relative to the current gradient and 
therefore unattractive as a candidate. Surprisingly, however, such an inter- 
pretation does not always apply in correlated problems. There are situations 
where j is informative, but ruj^k can be artificially large due to correlation. 

To see why, suppose that j is an informative variable with a relatively large 
value of dj^k- Now, if j and k are correlated, so much so that Rj i^ w dj^ki 
then \dj^k ~ ~ 0. Hence, rrij^k ~ oo and Vj^k ~ 1 due to (4.1). Thus, even 
though j is promising with a large gradient-correlation, it is unlikely to be 
selected because of its high correlation with k. 

The problem is that j becomes an unlikely candidate for selection when dj^k 
is close to Rj,k- In fact, rnj^k = oo when dj^k = Rj,k so that j can never 
become more favorable than k when the two values are equal. We have 
already discussed the condition dj^k = Rj,k several times now, and have re- 
ferred to it as the repressible condition. Repressibility plays an important 
role in correlated settings. We distinguish between two types of repressibil- 
ity: weak and strong repressibility. Weak repressibility occurs in the triv- 
ial case when \Rj^k\ = 1- Weak repressibility implies that \dj^k\ = \Rj,k\ = 1- 
Hence the gradient-correlation for j and k are equal in absolute value and j, 
and k are perfectly correlated. This trivial case simply reflects a numerical 
issue arising from the redundancy of the j and k columns of the X design 
matrix. The stronger notion of repressibility, which we refer to as strong re- 
pressibility, is required to address the nontrivial case \Rj^k\ 7^ 1 in which j is 
repressed without being perfectly correlated with k. The following definition 
summarizes these ideas. 

Definition 3. We say j has the strong repressible condition if dj^k = 
Rj^k and \Rj^k\ < 1- We say that j is (strongly) repressed by k when this 
happens. On the other hand, j has the weak repressible condition if j and k 
are perfectly correlated {\Rj^k \ = 1) and dj^k = Rj,k- 

4.1. An illustrative example. We present a numerical example of how 
repressibility can hinder variables from being selected. For our illustration 
we use example (d) of Section 5 from Zou and Hastie (2005). The data was 
simulated according to 



y = X/3 + as, 



e~ A^(0,I) 
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where n = 100, p = 40 and a = 15. The first 15 coordinates of (3 were set to 3; 
ah other coordinates were 0. The design matrix X = [Xi, . . . , X4o]ioox40 was 
simulated according to 





= Zi 


+ Tej, 


j 


= 1,.. 


.,5, 




= Z2 


+ Tej, 


j 


= 6,.. 


.,10, 




= Z3 


+ Tej, 


j 


= 11,. 


..,15, 


Xi 


= Ej, 


j>15 


5 







where (Zj)'^ and {sj)f were i.i.d. A^(0,I) and t = 0.1. In this simulation, 
only coordinates 1 to 5, 6 to 10 and 11 to 15 have nonzero coefficients. These 
x-variables are uncorrelated across a group, but share the same correlation 
within a group. Because the within group correlation is high, but less than 1, 
the simulation is ideal for exploring the effects of strong repressibility. 

Figure 5 displays results from fitting Algorithm 3 for M* = 500 iterations 
with u = 0.05. The first 5 panels are the values (uj^k)^^! against the iteration 




Fig. 5. First 5 panels display for the first 5 coefficients from simulation (4-2): 

red points are iterations r where the descent direction fc €{!,..., 5}. Variables 2 and 3 
are never selected due to their excessively large Vj^t step sizes: an artifact of the corre- 
lation between the 5 variables. The last panel (bottom right) displays {dj^k)^^i for those 
iterations r where fc £ {1,4, 5}. 
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r = 1, . . . , 500, with points colored in red indicating iterations r where k G 
{1, . . . , 5} and k is used generically to denote the current descent direction. 
Notationally, the descent at iteration r is along k for a step size of ui^k, at 
which point I becomes more favorable than k and the descent switches to /, 
the next critical direction. The value plotted, k^^i is the step size for 
j = l,...,5. 

Whenever the selected coordinate is from the first group of variables (we 
are referring to the red points) one of the coordinates j = 1,4,5 achieves 
a small i^j ^ value. However, coordinates j = 2 and j = 3 maintain very large 
values throughout all iterations. This is despite the fact that the two coordi- 
nates generally have large values of dj^k-, especially during the early iterations 
(see the bottom right panel). This suggests that 1, 4 and 5 become active 
variables at some point in the solution path, whereas coordinates 2 and 3 
are never selected (indeed, this is exactly what happened). We can conclude 
that coordinates 2 and 3 are being strongly repressed by A; G {1, 4, 5}. Inter- 
estingly, coordinate 4 also appears to be repressed at later iterations of the 
algorithm. Observe how its dj k values decrease with increasing r (blue line 
in bottom right panel), and that its Vj^k values are only small at earlier iter- 
ations. Thus, we can also conclude that coordinates {1,5} eventually repress 
coordinate 4 as well. 

We note that the number of iterations M* = 500 used in the example 
is not very large, and if L2Boosting were run for a longer period of time, 
coordinates 2 and 3 will eventually enter the solution path (panels 2 and 3 
of Figure 5 show evidence of this already happening with Vj^k steadily de- 
creasing as r increases). However, doing so leads to overfitting and poor 
test-set performance (we provide evidence of this shortly). Using different 
values of u also did not resolve the problem. Thus, similar to the lasso, we 
find that L2 Boosting is unable to select entire groups of correlated vari- 
ables. Like the lasso this means it also will perform suboptimally in highly 
correlated settings. In the next section we introduce a simple way of adding 
L2-regularization as a way to correct this deficiency. 

5. Elastic net boosting. The tendency of the lasso to select only a hand- 
ful of variables from among a group of correlated variables was noted in Zou 
and Hastie (2005). To address this deficiency, Zou and Hastie (2005) de- 
scribed an optimization problem different from the classical lasso framework. 
Rather than relying only on Li-penalization, they included an additional L2- 
regularization parameter designed to encourage a ridge- type grouping effect, 
and termed the resulting estimator "the elastic net." Specifically, for a fixed 
A > (the ridge parameter) and a fixed Aq > (the lasso parameter), the 
elastic net was defined as 

( p p 

(5.1) ^,,,t = (l + A)argmin<^ ||y _ x/3f + A V /^^ + Aq V |/3fc 
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To calculate the elastic net, Zou and Hastie (2005) showed that (5.1) could 
be recast as a lasso optimization problem by replacing the original data 
with suitably constructed augmented values. They replaced y (n x 1) and 
X. (n X p) with augmented values y* and X*, defined as follows: 



(5.2) 



(n+p) X 1 



X 



— [^1 , • • • 1 

(n+p)xp 



The elastic net optimization can be written in terms of the augmented data 
by reparameterizing /3 as /3* = + A. By Lemma 1 of Zou and Hastie 
(2005), it follows that (5.1) can be expressed as 




Penet = 7rTAargmin<^ ||y* - X*/3f + 

f3eRp [ Vl + A^ 

which is an Li-optimization problem that can be solved using the lasso. 

One explanation for why the elastic net is so successful in correlated 
problems is due to its decorrelation property. Let i?*^ = X|^X^. Because 
the data is standardized such that XjXj = X^X^ = 1 [recall (1.3)], we have 



XjX,. _ R 



1 + A 1 + A 
XTX. + A 



' ' -1, ifj = k. 



1 + A 

One can see that A is a decorrelation parameter, with larger values reducing 
the correlation between coordinates. Zou and Hastie (2005) argued that this 
effect promotes a "grouping property" for the elastic net that overcomes the 
lasso's inability to select groups of correlated variables. 

We believe that decorrelation is an important component of the elastic 
net's success. However, we will argue that in addition to its role in decorre- 
lation, A has a surprising connection to repressibility that further explains 
its role in regularizing the elastic net. 

The argument for the elastic net follows as a special case (the limit) of 
a generalized -L2 Boosting procedure we refer to as elasticBoost. The elastic- 
Boost algorithm is a modification of L2 Boosting applied to the augmented 
problem. To implement elasticBoost one runs L2Boosting on the augmented 
data (5.2), adding a post-processing step to rescale the coefficient solution 
path: see Algorithm 4 for a precise description. For arbitrarily small v, the 
solution path for elasticBoost approximates the elastic net, but for general 
<iy <1, elasticBoost represents a novel extension of L2 Boosting. We study 
the general elasticBoost algorithm, for arbitrary < u <1, and present a de- 
tailed explanation of how A imposes L2-regularization. 
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Algorithm 4 elasticBoost 
1: Augment the data (5.2). Set Fg*- = iov i = 1, . . . ,n + p. 
2: Run Algorithm 3 for M iterations using the augmented data. 
3: Let F^j ^ denote the M-step predictor (discard • for i> n). Let /3| 

denote the M-step coefficient estimate. 
4: Rescale the regression estimates: /Sm./c = \/l + A/3^j ^ . 



5.1. How A regularizes the solution path. To study the effect A has on 
elasticBoost's solution path we consider in detail how A effects ^^i-j^, the 
number of steps to favorability [defined as in (2.3) but with y and X re- 
placed by their augmented values y* and X*]. At initialization, the gradient- 
correlation for j ^ k is 

p-, = xf (y*-r;) 

In the special case when Fq • = 0, corresponding to the first descent of the 
algorithm, 

Therefore, d* = Pj^i/pk,i = dj^k-, and hence 

log jdj- fc - Rl^\ - log(l - Rl^sgnjdj^k - Rlk)) ' 
log(l -v) 

This equals the number of steps in the original (nonaugmented) problem 
but where X is replaced with variables decorrelated by a factor of ^/Y^rX■ 
For large values of A this addresses the problem seen in Figure 5. Recall 
we argued that nrij^k can became inflated due to the near equality of dj^^ 
with Rj^k- However, i?*^ = Rj^k/^/^ + A shrinks to zero with increasing A, 
which keeps m* ^ from becoming inflated. 

This provides one explanation for A's role in regularization, at least for 
the case when A is large. But we now suggest another theory that applies 
for both small and large A. We argue that regularization is imposed not just 
by decorrelation, but through a combination of decorrelation and reversal 
of repressibility. Thus A's role is more subtle than our previous argument 
suggests. 

To show this, let us suppose that near-repressibility holds. We assume 
therefore that Rj^k = '^j,fc(l + <^) for some small \6\ < 1. Then, 

log jdj-fc - Rlk\ - log(l - i?*fcSgn(dj- fc - R* k)) 



m*i^ = floor 
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(5.3) 



log|dj,fc| +log 



1 



l + S 

VTTx 



log 1 



Repressibility effect 



VTTx 



sgn 



1 



1 



1 + 6 ^/TT\ 



Decorrelation effect 

The first term on the right captures the effect of repressibility. When 5 is 
small, A plays a crucial role in controlling its size. If A = 0, the expression 
reduces to log + log |(5| which converges to — 00 as |(5| — )• 0; thus preclud- 
ing j from being selected [keep in mind that (5.3) is divided by log(l — v), 
which is negative; thus 'mj;^ 00]. On the other hand, any A > 0, even 
a relatively small value, ensures that the expression remains small even for 
arbitrarily small 5, thus reversing the effect of repressibility. 

The second term on the right of (5.3) is related to decorrelation. If 1 + A > 
(1 + 5)^ (which holds if A is large enough when 5 > 0, or for all A > if 5 < 0), 
the term reduces to 

"'^■'^ :sgn(i?,,) 



log 1 



VTTx 



which remains bounded when A > if Rj^^ 
(1 + 5)^, the term reduces to 



1. On the other hand, if 1 + A < 



log 1 + 



R 



■.sgn{Rj^k, 



VTTx 

which remains bounded if Rj^^ — ^ 1 and shrinks in absolute size as A in- 
creases. 

Taken together, these arguments show A imposes L2-regularization through 
a combination of decorrelation and the reversal of repressibility which ap- 
plies even when A is relatively small. 

These arguments apply to the first descent. The general case when 
requires a detailed analysis of c?*^. In general, 



d 



VXR 



0,n+j 



VXF* 



En 
i=l Xi,kJ- Q^i — V ^v-t o,n+fc 

We break up the analysis into two cases depending on the size of A. Suppose 
first that A is small. Then 



E 



y^ly-Y.U^^,kFSV 

which is the ratio of gradient correlations based on the original X without 
pseudo-data. If j is a promising variable, then d* ^ will be relatively large, 
and our argument from above applies. On the other hand if A is large, then 
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the third term in the numerator and the denominator of become the 
dominating terms and 

TP* 

_ ^0,n+j 
^ 0,n+k 

The growth rate of Fq- for the pseudo data is O(i^) for a group of variables 
that are actively being explored by the algorithm. Thus \d* x 1 and our 
previous argument applies. 

5.2. Illustration. As evidence of this, and to demonstrate the effective- 
ness of elasticBoost, we re-analyzed (4.2) using Algorithm 4. We used the 
same parameters as in Figure 5 (M* = 500 and = 0.05). We set A = 0.5. 
The results are displayed in Figure 6. In contrast to Figure 5, notice that 
all 5 of the first group of correlated variables achieve small i^*^ values (and 
we confirmed that all 5 variables enter the solution path). It is interesting 
to note that d* ^ is nearly 1 for each of these variables. 

To compare L2Boosting and elasticBoost more evenly, we used 10-fold 
cross-validation to determine the optimal number of iterations (for elas- 
ticBoost, we used doubly-optimized cross-validation to determine both the 
optimal number of iterations and the optimal A value; the latter was found to 
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Fig. 6. elasticBoost applied to simulation (4-S) (plots are constructed as m Figure 5). 
Now each of the first 5 coordinates are selected and each has d* f, values near one. 
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Fig. 7. L-zBoosting (top row) versus elasticBoost (bottom row) from simulation (4-2). 



equal A = 0.1). Figure 7 displays the results. The top row displays L2Boosting, 
while the bottom row is elasticBoost (fit under the optimized A). The min- 
imum mean-squared-error (MSE) is slightly smaller for elasticBoost (217.9) 
than L2 Boosting (231.7) (first panels in top and bottom rows). Curiously, the 
MSE is minimized using about same number of iterations for both methods 
(190 for L2 Boosting and 169 for elasticBoost). The middle panels display the 
coefficient paths. The vertical blue line indicates the MSE optimized number 
of iterations. In the case of L2Boosting only 4 nonzero coefficients are iden- 
tified within the optimal number of steps, whereas elasticBoost finds all 15 
nonzero coefficients. This can be seen more clearly in the right panels which 
show coefficient estimates at the optimized stopping time. Not only are all 15 
nonzero coefficients identified by elasticBoost, but their estimated coefficient 
values are all roughly near the true value of 3. In contrast, L2Boosting finds 
only 4 coefficients due to strong repressibility. Its coefficient estimates are 
also wildly inaccurate. While this does not overly degrade prediction error 
performance (as evidenced by the first panel), variable selection performance 
is seriously impacted. 

The entire experiment was then repeated 250 times using 250 indepen- 
dent learning sets. Figure 8 displays the coefficient estimates from these 250 
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Fig. 8. elasticBoost (left) versus L2Boosting (right) from simulation (4-2) for n= 100 
(top) and n= 1000 (bottom) based on 250 independent learning samples. The distribution 
of coefficient estimates are displayed as boxplots; mean values are given in red. 



experiments for elasticBoost (left side) and L2Boosting (right side) as box- 
plots. The top panel are based on the original sample size of n = 100 and 
the bottom panel use a larger sample size n = 1000. The results confirm 
our previous finding: elasticBoost is consistently able to group variables and 
outperform L2Boosting in terms of variable selection. 

Finally, the left panel of Figure 9 displays the difference in test set MSE 
for L2Boosting and elasticBoost as a function of A over the 250 experiments 
(n = 100). Negative values indicate a lower MSE for elasticBoost, which is 
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Fig. 9. Left: difference in test set performance of L2 Boosting compared to elasticBoost. 
Right: difference in MSB optimized number of iterations for L2 Boosting compared to elas- 
ticBoost. 



generally the case for larger A. The right panel displays the MSE optimized 
number of iterations for L2Boosting compared to elasticBoost. Generally, 
elasticBoost requires fewer steps as A increases. This is interesting, because 
as pointed out, this generally coincides with better MSE performance. 

6. Discussion. A key observation is that L2Boosting's behavior along 
a fixed descent direction is fully specified with the exception of the de- 
scent length, Lr- In Theorem 2, we described a closed form solution for 
rrij^k, the number of steps until favorability, where k = lr is the currently 
selected coordinate direction and j = Ir+i is the next most favorable direc- 
tion. Theorem 2 quantifies L2Boosting's descent length, thus allowing us to 
characterize its solution path as a series of fixed descents where the next 
coordinate direction, chosen from all candidates j ^ k, is determined as that 
with the minimal descent length rrij^k (assuming no ties). Since we choose 
from among all directions j ^ k, rrij^ki ^^'i equivalently the step length i/j fc, 
can be characterized as measures to favorability, a property of each coordi- 
nate at any iteration r. These measures are a function of i' and the ratio of 
gradient-correlations dj^k and the correlation coefficient Rj^k relative to the 
currently selected direction k. 

Characterizing the L2Boosting solution path by mj^k provides consid- 
erable insight when examining the limiting conditions. When rrij^k Ij 
L2Boosting exhibits active set cycling, a property explored in detail in Sec- 
tion 3. We note that this condition is fundamentally a result of the opti- 
mization method which drives {dj^/^l — )• 1 when u is arbitrarily small. This 
virtually guarantees the notorious slow convergence seen with infinitesimal 
forward stagewise algorithms. 
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The repressibility condition occurs in the alternative hmiting condition 
rrij^k — ^ oo. Repressibihty arises when the gradient correlation ratio dj^k 
equals the correlation Rj^k- When \ Rj^k\ < 1; J is said to be strongly repressed 
by k, and while descending along k, the absolute gradient-correlation for j 
can never be equal to or surpass the absolute gradient-correlation for k. 
Strong repressibility plays a crucial role in correlated settings, hindering 
variables from being actively selected. Adding L2 regularization reverses re- 
pressibility and substantially improves variable selection for elasticBoost, 
an L2Boosting implementation involving the data augmentation framework 
used by the elastic net. 

SUPPLEMENTARY MATERIAL 

Proofs of results from "Characterizing L2 Boosting" 

(DOL 10.1214/12-AOS997SUPP; .pdf). An online supplementary file con- 
tains the detailed proofs for Theorems 1 through 9. These proofs make use 
of various notation described in the paper. 
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